On the resemblance and containment of documents

نویسنده

  • Andrei Z. Broder
چکیده

Given two documents A and B we define two mathematical notions: their resemblance r(A,B) and their containment c(A,B) that seem to capture well the informal notions of “roughly the same” and “roughly contained.” The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin fingerprints.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Expansion Factors of Extremism on the Continent of Africa and its Containment Strategies

In various parts of the world, the continent of Africa is struggling with the phenomenon of terrorism and extremism more than other regions. This continent, for various reasons, including the various weaknesses of software and hardware, including economic backwardness, political instabilities, social inequalities, and foreign interventions that find their interests only in insecurity and instab...

متن کامل

A Rewrite Approach for Pattern Containment - Application to Query Evaluation on Compressed Documents

In this paper we introduce an approach that allows to handle the containment problem for the fragment XP(/,//,[ ],∗) of XPath. Using rewriting techniques we define a necessary and sufficient condition for pattern containment. This rewrite view is then adapted to query evaluation on XML documents, and remains valid even if the documents are given in a compressed form, as dags.

متن کامل

Beyond Self Containment: On the Politics of Culture and Identity in a Glocal Society

As a result of the epiphany of giant multinational media conglomerates, transnational trade networks and the politics of globalization, it is tempting to believe that individual and national identities have morphed. This article argues that such homogenization in relation to individuation is tedious to accept. It draws from theories of symbolic interactionism, social psychology, Foucauldian, an...

متن کامل

IN SITU REFRACTORIES USED IN THE CONTAINMENT OF MOLTEN IRON AND STEEL STRATEGIES FOR THEIR DEVELOPMENT

In situ refractories are defined as brick or unshaped products, which react internally or with furnace atmospheres and/or slag components so as to be enhanced in their performance. Examples of such products are discussed with emphasis on those that are currently employed and are being developed for the melting of iron and steel. Some strategies for the development of future in situ products are...

متن کامل

A Rewrite Approach for Pattern Containment

In this paper we introduce an approach that allows to handle the containment problem for the fragment XP(/,//,[ ],∗) of XPath. Using rewriting techniques we define a necessary and sufficient condition for pattern containment. This rewrite view is then adapted to query evaluation on XML documents, and remains valid even if the documents are given in a compressed form, as dags.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997